Capstone #1

File Descriptions

  • DATA_ELEMENT_DESCRIPTION.csv defines each data element and indicates where its description is found in Data Sources, Definitions, and Notes.
  • DEFINED_DATA_VALUE.csv defines the meaning of specific values (such as missing or suppressed data).
  • DEMOGRAPHICS.csv identifies the data elements and values in the Demographics indicator domain.
  • HEALTHY_PEOPLE_2010.csv identifies the Healthy People 2010 Targets and the U.S. Percentages or Rates.
  • LEADING_CAUSES_OF_DEATH.csv identifies the data elements and values in the Leading Causes of Death indicator domain.
  • MEASURES_OF_BIRTH_AND_DEATH.csv identifies the data elements and values in the Measures of Birth and Death indicator domain.
  • PREVENTIVE_SERVICES_USE.csv identifies the data elements and values in the Preventive Services indicator domain.
  • RELATIVE_HEALTH_IMPORTANCE.csv identifies the data elements and values in the Relative Health Importance indicator domain.
  • RISK_FACTORS_AND_ACCESS_TO_CARE.csv identifies the data elements and values in the Risk Factors and Access to Care indicator domain.
  • SUMMARY_MEASURES_OF_HEALTH.csv identifies the data elements and values in the Summary Measures of Health indicator domain.
  • VULNERABLE_POPS_AND_ENV_HEALTH.csv identifies the data elements and values in the Vulnerable Populations and Environmental Health indicator domain.

Data source can be found here

In [1]:
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.cm as cm
from matplotlib.pyplot import figaspect
from mpl_toolkits.axes_grid1 import make_axes_locatable
import geopandas as gpd
import seaborn as sns
from scipy.stats.stats import pearsonr
mpl.rcParams.update({'font.size': 14})
In [2]:
#Reading all 11 csv files
files = sorted(glob.glob('../CHSI_dataset/*.csv'))  

File Initialization

In [3]:
DATA_ELEMENT_DESCRIPTION = pd.read_csv(files[0])
DEFINED_DATA_VALUE = pd.read_csv(files[1])
DEMOGRAPHICS = pd.read_csv(files[2])
HEALTHY_PEOPLE_2010 = pd.read_csv(files[3])
LEADING_CAUSES_OF_DEATH = pd.read_csv(files[4])
MEASURES_OF_BIRTH_AND_DEATH = pd.read_csv(files[5])
PREVENTIVE_SERVICES_USE = pd.read_csv(files[6])
RELATIVE_HEALTH_IMPORTANCE = pd.read_csv(files[7])
RISK_FACTORS_AND_ACCESS_TO_CARE = pd.read_csv(files[8])
SUMMARY_MEASURES_OF_HEALTH = pd.read_csv(files[9])
VULNERABLE_POPS_AND_ENV_HEALTH = pd.read_csv(files[10])
In [4]:
#Auxiliary files 
DATA_ELEMENT_DESCRIPTION
DEFINED_DATA_VALUE
HEALTHY_PEOPLE_2010
Out[4]:
Categories Elements US_Pct_or_Rate_2003 Healthy_People_2010_Target
0 Birth Measures Low Birth Wt. (<2500 g) 7.9 5.0
1 Birth Measures Very Low Birth Wt. (<1500 g) 1.4 0.9
2 Birth Measures Premature Births (<37 weeks) 12.3 7.6
3 Birth Measures Births to Women under 18 3.4 -9998.9
4 Birth Measures Births to Women over 40 2.6 -9998.9
5 Birth Measures Births to Unmarried Women 34.6 -9998.9
6 Birth Measures No Care in First Trimester 16.0 10.0
7 Infant Mortality Infant Mortality 6.8 4.5
8 Infant Mortality White non Hispanic Infant Mortality 5.7 4.5
9 Infant Mortality Black non Hispanic Infant Mortality 13.6 4.5
10 Infant Mortality Hispanic Infant Mortality 5.6 4.5
11 Infant Mortality Neonatal Infant Mortality 4.6 2.9
12 Infant Mortality Post-neonatal Infant Mortality 2.2 1.2
13 Death Measures Breast Cancer (Female) 25.3 21.3
14 Death Measures Colon Cancer 19.1 13.7
15 Death Measures Coronary Heart Disease 172.0 162.0
16 Death Measures Homicide 6.0 2.8
17 Death Measures Lung Cancer 54.1 43.3
18 Death Measures Motor Vehicle Injuries 14.8 8.0
19 Death Measures Stroke 53.0 50.0
20 Death Measures Suicide 10.8 4.8
21 Death Measures Unintentional Injury 37.3 17.1
In [5]:
LEADING_CAUSES_OF_DEATH.head()
Out[5]:
State_FIPS_Code County_FIPS_Code CHSI_County_Name CHSI_State_Name CHSI_State_Abbr Strata_ID_Number A_Wh_Comp CI_Min_A_Wh_Comp CI_Max_A_Wh_Comp A_Bl_Comp ... F_Bl_Cancer CI_Min_F_Bl_Cancer CI_Max_F_Bl_Cancer F_Ot_Cancer CI_Min_F_Ot_Cancer CI_Max_F_Ot_Cancer F_Hi_Cancer CI_Min_F_Hi_Cancer CI_Max_F_Hi_Cancer LCD_Time_Span
0 1 1 Autauga Alabama AL 29 -1111 -1111 -1111 -1111 ... 19 14 23 -1111 -1111 -1111 -1111 -1111 -1111 1999-2003
1 1 3 Baldwin Alabama AL 16 57 39 75 -1111 ... 20 15 25 -1111 -1111 -1111 -1111 -1111 -1111 2001-2003
2 1 5 Barbour Alabama AL 51 -1111 -1111 -1111 -1111 ... 26 22 31 -1111 -1111 -1111 -1111 -1111 -1111 1999-2003
3 1 7 Bibb Alabama AL 42 -1111 -1111 -1111 -1111 ... 20 14 25 -1111 -1111 -1111 -1111 -1111 -1111 1994-2003
4 1 9 Blount Alabama AL 28 34 17 52 -1111 ... 28 10 46 -1111 -1111 -1111 -1111 -1111 -1111 1999-2003

5 rows × 235 columns

In [6]:
#Outlier column name
DATA_ELEMENT_DESCRIPTION.update(
    DATA_ELEMENT_DESCRIPTION.replace('C_Ot_homicide','C_Ot_Homicide'))
LEADING_CAUSES_OF_DEATH.rename(columns={'C_Ot_homicide':'C_Ot_Homicide'}, inplace=True)

#Functions to describe acronyms
def description(param):
    if param in list(DATA_ELEMENT_DESCRIPTION.COLUMN_NAME):
        description_value = DATA_ELEMENT_DESCRIPTION[DATA_ELEMENT_DESCRIPTION.
                                     COLUMN_NAME==param].DESCRIPTION.tolist()[0]
    else:
        description_value = ''   
    return description_value

def short_description(param):
    list_of_removing = ['County data,', 'death measures,', 'birth measures,', ',']
    result = description(param)
    for i in list_of_removing:
        if result.__contains__(i):
            result = result.replace(i,'')
    return result

def short_race_description(param):
    list_of_removing = ['Black', 'Hispanic','White', 'other', 
                        'County data,', 'death measures,', 'birth measures,', ',']
    result = description(param)
    for i in list_of_removing:
        if result.__contains__(i):
            result = result.replace(i,'')
    return result
In [7]:
#Common columns in all worksheets
county_info = ['State_FIPS_Code', 'County_FIPS_Code', 'CHSI_County_Name',
               'CHSI_State_Name', 'CHSI_State_Abbr', 'Strata_ID_Number']
In [8]:
#Joining files
worksheets = [DEMOGRAPHICS,
LEADING_CAUSES_OF_DEATH,
MEASURES_OF_BIRTH_AND_DEATH, 
PREVENTIVE_SERVICES_USE,
RELATIVE_HEALTH_IMPORTANCE,
RISK_FACTORS_AND_ACCESS_TO_CARE,
SUMMARY_MEASURES_OF_HEALTH,
VULNERABLE_POPS_AND_ENV_HEALTH]
In [9]:
#Replacing not available/reported data by zeros
nan_values = [-1111, -1111.1, -1, -9999, -2222, -2222.2, -2]
for worksheet in worksheets:
    nda =[]
    for i,nan_value in enumerate(nan_values):
        nda.append(worksheet.loc[:,:]!=nan_value)
        worksheet.update(worksheet.where(nda[i]).fillna(0))        
In [10]:
#Time dependent data should be identified and normalized
time_dependent_worksheets = [LEADING_CAUSES_OF_DEATH,
MEASURES_OF_BIRTH_AND_DEATH,
PREVENTIVE_SERVICES_USE,
VULNERABLE_POPS_AND_ENV_HEALTH]


time_spans = ['1994-2003', '1999-2003', '2001-2003']
time_span_convert = [7,4,2]
for worksheet in time_dependent_worksheets:
    for i in list(worksheet.columns):
        if i.__contains__('Time_Span'):
            worksheet[i].replace(time_spans, time_span_convert, inplace=True)
            worksheet.rename(columns={i: 'Time'}, inplace=True)

for worksheet in time_dependent_worksheets:
    for i in list(worksheet.columns):
        if worksheet.columns.get_loc(i) < worksheet.columns.get_loc('Time') and i not in county_info:
            worksheet[i]=worksheet[i]/worksheet['Time']
          
In [11]:
#Merging all clean data files
df = worksheets[0]
for name in worksheets[1:]:
    df = df.merge(name, on=county_info.append('Time'), how='outer', sort=True).fillna(0)    
In [12]:
df.head()
Out[12]:
State_FIPS_Code County_FIPS_Code CHSI_County_Name CHSI_State_Name CHSI_State_Abbr Strata_ID_Number Strata_Determining_Factors Number_Counties Population_Size Min_Population_Size ... Shig_Rpt Shig_Rpt_Ind Shig_Exp Toxic_Chem Carbon_Monoxide_Ind Nitrogen_Dioxide_Ind Sulfur_Dioxide_Ind Ozone_Ind Particulate_Matter_Ind Lead_Ind
0 1 1 Autauga Alabama AL 29 frontier status, population size, poverty, age 37 48612 28447 ... 1.000000 0.750000 2.750000 720799.250000 0.250000 0.250000 0.250000 0.250000 0.250000 0.250000
1 1 3 Baldwin Alabama AL 16 frontier status, population size, poverty, age 27 162586 118395 ... 20.500000 2.000000 6.500000 17832.000000 0.500000 0.500000 0.500000 1.000000 0.500000 0.500000
2 1 5 Barbour Alabama AL 51 frontier status, population size, poverty, age... 33 28414 27269 ... 0.250000 0.750000 4.000000 12468.500000 0.250000 0.250000 0.250000 0.250000 0.250000 0.250000
3 1 7 Bibb Alabama AL 42 frontier status, population size, poverty, age 53 21516 8134 ... 0.428571 0.428571 2.285714 1043.142857 0.142857 0.142857 0.142857 0.142857 0.142857 0.142857
4 1 9 Blount Alabama AL 28 frontier status, population size, poverty, age 39 55725 29009 ... 2.750000 1.000000 1.250000 222255.000000 0.250000 0.250000 0.250000 0.250000 0.250000 0.250000

5 rows × 533 columns

Redundant columns

In [13]:
observation = list(df.columns)
description_list = [description(i) for i in observation]
def observ_keyword(variable):
    results = []
    for i, j in enumerate(description_list):
        if variable in str(j):
            results.append(observation[i])
    return results
In [14]:
filter_percentile = list(description(i) for i in list(observ_keyword('percentile')))
filter_CI = list(description(i) for i in list(observ_keyword('Confidence interval'))) 
filter_fav = list(description(i) for i in list(observ_keyword('Favorable indicator')))

filters = list(observ_keyword('Favorable indicator') 
               + observ_keyword('Confidence interval') 
               + observ_keyword('percentile'))
df = df.drop(columns=filters)
In [15]:
#Neutral columns
for i in list(df.columns):
    if len(np.unique(df[i]))==1 and i!='D_Ot_HIV':
        df=df.drop(columns=i)
df.head()
Out[15]:
State_FIPS_Code County_FIPS_Code CHSI_County_Name CHSI_State_Name CHSI_State_Abbr Strata_ID_Number Strata_Determining_Factors Number_Counties Population_Size Population_Density ... Salm_Exp Shig_Rpt Shig_Exp Toxic_Chem Carbon_Monoxide_Ind Nitrogen_Dioxide_Ind Sulfur_Dioxide_Ind Ozone_Ind Particulate_Matter_Ind Lead_Ind
0 1 1 Autauga Alabama AL 29 frontier status, population size, poverty, age 37 48612 82.0 ... 7.750000 1.000000 2.750000 720799.250000 0.250000 0.250000 0.250000 0.250000 0.250000 0.250000
1 1 3 Baldwin Alabama AL 16 frontier status, population size, poverty, age 27 162586 102.0 ... 33.500000 20.500000 6.500000 17832.000000 0.500000 0.500000 0.500000 1.000000 0.500000 0.500000
2 1 5 Barbour Alabama AL 51 frontier status, population size, poverty, age... 33 28414 32.0 ... 7.250000 0.250000 4.000000 12468.500000 0.250000 0.250000 0.250000 0.250000 0.250000 0.250000
3 1 7 Bibb Alabama AL 42 frontier status, population size, poverty, age 53 21516 35.0 ... 4.571429 0.428571 2.285714 1043.142857 0.142857 0.142857 0.142857 0.142857 0.142857 0.142857
4 1 9 Blount Alabama AL 28 frontier status, population size, poverty, age 39 55725 86.0 ... 7.750000 2.750000 1.250000 222255.000000 0.250000 0.250000 0.250000 0.250000 0.250000 0.250000

5 rows × 198 columns

In [16]:
df.describe().transpose()
Out[16]:
count mean std min 25% 50% 75% max
State_FIPS_Code 3141.0 30.304680 1.513442e+01 1.000000 18.000000 29.000000 45.000000 5.600000e+01
County_FIPS_Code 3141.0 103.716651 1.079995e+02 1.000000 35.000000 79.000000 133.000000 8.400000e+02
Strata_ID_Number 3141.0 44.696275 2.511843e+01 1.000000 23.000000 44.000000 66.000000 8.800000e+01
Number_Counties 3141.0 38.486151 1.029019e+01 15.000000 32.000000 37.000000 45.000000 6.200000e+01
Population_Size 3141.0 94368.164279 3.064317e+05 62.000000 11211.000000 25235.000000 64040.000000 9.935475e+06
Population_Density 3141.0 249.826807 1.702476e+03 0.000000 17.000000 44.000000 109.000000 6.939000e+04
Poverty 3141.0 13.345909 4.888337e+00 0.000000 9.800000 12.600000 16.200000 3.620000e+01
Age_19_Under 3141.0 24.806527 3.281777e+00 1.400000 22.700000 24.600000 26.400000 4.720000e+01
Age_19_64 3141.0 60.289398 3.356056e+00 47.600000 58.300000 60.300000 62.300000 8.330000e+01
Age_65_84 3141.0 12.789430 3.334035e+00 2.100000 10.700000 12.500000 14.700000 2.920000e+01
Age_85_and_Over 3141.0 2.115409 9.491186e-01 0.100000 1.500000 1.900000 2.600000 7.600000e+00
White 3141.0 87.017892 1.615048e+01 4.700000 82.800000 94.100000 97.600000 1.000000e+02
Black 3141.0 8.986692 1.454566e+01 0.000000 0.500000 2.100000 10.300000 8.600000e+01
Native_American 3141.0 1.974244 7.675445e+00 0.000000 0.200000 0.400000 0.900000 9.340000e+01
Asian 3141.0 1.123050 2.757237e+00 0.000000 0.300000 0.500000 1.000000 5.590000e+01
Hispanic 3141.0 7.017988 1.246473e+01 0.000000 1.100000 2.300000 6.300000 9.750000e+01
A_Wh_Comp 3141.0 5.194831 9.009938e+00 0.000000 0.000000 0.000000 8.250000 3.850000e+01
A_Bl_Comp 3141.0 2.484343 7.720221e+00 0.000000 0.000000 0.000000 0.000000 3.900000e+01
A_Ot_Comp 3141.0 0.252683 2.456387e+00 0.000000 0.000000 0.000000 0.000000 3.350000e+01
A_Hi_Comp 3141.0 1.156956 5.311160e+00 0.000000 0.000000 0.000000 0.000000 3.650000e+01
A_Wh_BirthDef 3141.0 2.590099 4.600485e+00 0.000000 0.000000 0.000000 4.250000 2.850000e+01
A_Bl_BirthDef 3141.0 0.432335 1.659780e+00 0.000000 0.000000 0.000000 0.000000 1.500000e+01
A_Ot_BirthDef 3141.0 0.109804 1.143808e+00 0.000000 0.000000 0.000000 0.000000 2.350000e+01
A_Hi_BirthDef 3141.0 0.550007 2.628706e+00 0.000000 0.000000 0.000000 0.000000 2.600000e+01
B_Wh_Injury 3141.0 1.766601 5.279414e+00 0.000000 0.000000 0.000000 0.000000 3.350000e+01
B_Bl_Injury 3141.0 0.475201 2.769970e+00 0.000000 0.000000 0.000000 0.000000 3.000000e+01
B_Ot_Injury 3141.0 0.090679 1.249387e+00 0.000000 0.000000 0.000000 0.000000 2.750000e+01
B_Hi_Injury 3141.0 0.261223 2.136312e+00 0.000000 0.000000 0.000000 0.000000 2.750000e+01
B_Wh_Cancer 3141.0 0.582583 2.194628e+00 0.000000 0.000000 0.000000 0.000000 1.750000e+01
B_Bl_Cancer 3141.0 0.080388 7.550155e-01 0.000000 0.000000 0.000000 0.000000 1.200000e+01
... ... ... ... ... ... ... ... ...
Elderly_Medicare 3141.0 10900.620503 3.137918e+04 0.000000 1607.000000 3448.000000 8053.000000 9.115900e+05
Disabled_Medicare 3141.0 1881.088188 4.684165e+03 0.000000 275.000000 690.000000 1667.000000 1.181900e+05
Prim_Care_Phys_Rate 3141.0 57.562464 4.479377e+01 0.000000 30.500000 50.600000 74.700000 5.812000e+02
Dentist_Rate 3141.0 32.154569 2.150797e+01 0.000000 18.700000 30.000000 43.300000 3.779000e+02
Community_Health_Center_Ind 3141.0 1.550780 4.974939e-01 1.000000 1.000000 2.000000 2.000000 2.000000e+00
HPSA_Ind 3141.0 1.758994 4.277621e-01 1.000000 2.000000 2.000000 2.000000 2.000000e+00
ALE 3141.0 76.274276 2.774060e+00 0.000000 75.000000 76.500000 77.700000 8.130000e+01
US_ALE 3141.0 76.451289 1.930071e+00 0.000000 76.500000 76.500000 76.500000 7.650000e+01
All_Death 3141.0 904.816874 1.342697e+02 0.000000 814.000000 898.500000 989.800000 1.869600e+03
US_All_Death 3141.0 898.313913 1.603365e+01 0.000000 898.600000 898.600000 898.600000 8.986000e+02
Health_Status 3141.0 13.656033 8.901317e+00 0.000000 8.900000 14.600000 19.400000 4.770000e+01
Unhealthy_Days 3141.0 5.046864 2.614327e+00 0.000000 4.500000 5.700000 6.600000 1.260000e+01
No_HS_Diploma 3141.0 5034.707964 2.393183e+04 0.000000 253.571429 706.000000 2269.500000 9.361580e+05
Unemployed 3141.0 1029.821940 4.033053e+03 0.000000 41.000000 111.285714 417.500000 1.282360e+05
Sev_Work_Disabled 3141.0 1284.129019 1.394675e+04 0.000000 42.428571 128.714286 465.750000 7.371660e+05
Major_Depression 3141.0 2317.251535 8.232223e+03 0.571429 96.142857 229.714286 970.000000 2.479525e+05
Recent_Drug_Use 3141.0 2313.389582 9.424556e+03 0.285714 78.285714 211.571429 893.250000 3.326960e+05
Ecol_Rpt 3141.0 1.518579 4.496007e+00 0.000000 0.000000 0.250000 1.000000 1.010000e+02
Ecol_Exp 3141.0 1.653579 4.624199e+00 0.000000 0.142857 0.428571 1.250000 1.465000e+02
Salm_Rpt 3141.0 17.183552 4.966769e+01 0.000000 1.500000 4.000000 10.750000 9.850000e+02
Salm_Exp 3141.0 17.140879 4.521148e+01 0.000000 2.142857 4.750000 12.000000 1.317500e+03
Shig_Rpt 3141.0 9.114784 4.087135e+01 0.000000 0.142857 0.750000 3.285714 1.013500e+03
Shig_Exp 3141.0 8.994747 2.840226e+01 0.000000 0.714286 1.714286 5.000000 7.320000e+02
Toxic_Chem 3141.0 428237.957566 2.258974e+06 0.000000 0.000000 9449.750000 156429.500000 7.124991e+07
Carbon_Monoxide_Ind 3141.0 0.235435 1.263729e-01 0.142857 0.142857 0.142857 0.250000 1.000000e+00
Nitrogen_Dioxide_Ind 3141.0 0.235275 1.257227e-01 0.142857 0.142857 0.142857 0.250000 5.000000e-01
Sulfur_Dioxide_Ind 3141.0 0.235275 1.257227e-01 0.142857 0.142857 0.142857 0.250000 5.000000e-01
Ozone_Ind 3141.0 0.263565 2.024743e-01 0.142857 0.142857 0.142857 0.250000 1.000000e+00
Particulate_Matter_Ind 3141.0 0.237595 1.340617e-01 0.142857 0.142857 0.142857 0.250000 1.000000e+00
Lead_Ind 3141.0 0.235435 1.263729e-01 0.142857 0.142857 0.142857 0.250000 1.000000e+00

194 rows × 8 columns

Demographic Map

To visualize the health indices throughout the country, I merge the map shapefile and data to us_merge dataframes. Because Alaska and Hawaii are further away and out of proportion I merged them separately. US_plot function creates the US map with any numeric variables in the data --default color is blue.

In [17]:
us_map = gpd.read_file('../US-map/UScounties.shp')
us_map_land = us_map[ (us_map['STATE_NAME']!='Alaska') 
                     &(us_map['STATE_NAME']!='Hawaii')]
us_map_Alaska = us_map[us_map['STATE_NAME']=='Alaska']
us_map_Hawaii = us_map[us_map['STATE_NAME']=='Hawaii']

#Merging map and data:
df_land = us_map_land.set_index('NAME').join(df.set_index('CHSI_County_Name'))
df_Alaska = us_map_Alaska.set_index('NAME').join(df.set_index('CHSI_County_Name'))
df_Hawaii = us_map_Hawaii.set_index('NAME').join(df.set_index('CHSI_County_Name'))
In [18]:
def min_max_col(variable):
    minimum = min(df_land[variable].min(), 
              df_Alaska[variable].min(), df_Hawaii[variable].min())
    maximum = max(df_land[variable].max(), 
              df_Alaska[variable].max(), df_Hawaii[variable].max())
    return minimum, maximum

def US_plot(variable, color='Blues'):
    h, w = figaspect(1.)
    mn, mx = min_max_col(variable)
    fig, ax1 = plt.subplots(1, figsize=(w*10, h*10))
    ax1.axis('off')
    ax1.set_title(description(variable), fontsize=40)
    divider = make_axes_locatable(ax1)
    cax = divider.append_axes('right', size='2%', pad=0.1)
    cax.tick_params(labelsize=25)
    df_land.plot(column=variable, cmap=color,
                        linewidth=0.8, ax=ax1,
                        edgecolor='0.8', vmin=mn ,vmax=mx,
                        legend=True, cax=cax)

    
    left, bottom, width, height = [0.15, 0.22, 0.24, 0.3] 
    ax2 = fig.add_axes([left, bottom, width, height])
    ax2.axis('off')
    ax2.set_title('Alaska', fontsize=25)
    df_Alaska.plot(column=variable, cmap=color, vmin=mn ,vmax=mx,
                          linewidth=0.8, ax=ax2, edgecolor='0.6')
    
    left, bottom, width, height = [0.35, 0.3, 0.2, 0.05]
    ax3 = fig.add_axes([left, bottom, width, height])
    ax3.axis('off')
    ax3.set_title('Hawaii', fontsize=25)
    df_Hawaii.plot(column =variable, cmap=color, vmin=mn ,vmax=mx,
                          linewidth=0.8, ax=ax3, edgecolor='0.6')
In [19]:
US_plot('Suicide', 'Greys')
In [20]:
US_plot('CHD', 'Reds')

Measures of Birth and Death

In [21]:
Birth_Death_list = list(np.intersect1d(df.columns, MEASURES_OF_BIRTH_AND_DEATH.columns))
Birth = []
Death = []
for i in Birth_Death_list:
    if i not in county_info:
        if description(i).__contains__('birth'):
            Birth.append(i)
        else:
            Death.append(i)
In [50]:
Death_Mean = {'Cause_of_Death':Death, 'Mean_Death':[np.mean(df[i]) for i in Death]}
Birth_Mean = {'Birth':Birth, 'Mean_Birth':[np.mean(df[i]) for i in Birth]}

df_death_mean = pd.DataFrame(Death_Mean)
df_death_mean = df_death_mean.drop(df_death_mean[df_death_mean['Cause_of_Death'] =='Total_Deaths'].index, axis=0)
df_death_mean = df_death_mean.sort_values(by=['Mean_Death'], ascending=False)

df_birth_mean = pd.DataFrame(Birth_Mean)
df_birth_mean = df_birth_mean.drop(df_birth_mean[df_birth_mean['Birth'] =='Total_Births'].index, axis=0)
df_birth_mean = df_birth_mean.sort_values(by=['Mean_Birth'], ascending=False)
In [51]:
fig, ax = plt.subplots(1, figsize=(5, 5))
labels=list(short_description(i) for i in df_death_mean.Cause_of_Death)
plt.barh(labels, df_death_mean.Mean_Death, color='brown', height=0.9)
plt.xlabel('Mean of death in each county', fontsize=12)
plt.title('Cause of death rates', y=0.845, fontsize=14, 
          bbox=dict(facecolor='none', edgecolor='k', boxstyle='round,pad=0.5'))
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()
In [53]:
fig, ax = plt.subplots(1, figsize=(4, 4))
labels=list(short_description(i) for i in df_birth_mean.Birth)
plt.barh(labels, df_birth_mean.Mean_Birth, color='b', height=0.9)
plt.xlabel('Mean of birth on each county', fontsize=12)
plt.title('Birth conditions', y=0.872, fontsize=14,
         bbox=dict(facecolor='none', edgecolor='k', boxstyle='round,pad=0.5'))
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()
In [25]:
Death_race_list = list(np.intersect1d(df.columns, LEADING_CAUSES_OF_DEATH.columns))
Black = []
Hispanic = []
White = []
other = []
for i in Death_race_list:
    if i not in county_info:
        if description(i).__contains__('Black'):
            Black.append(i)
        elif description(i).__contains__('Hispanic'):
            Hispanic.append(i)
        elif description(i).__contains__('White'):
            White.append(i)
        else:
            other.append(i)
    
In [26]:
Black_up_24=[]
Black_25=[]
Hispanic_up_24=[]
Hispanic_25=[]
White_up_24=[]
White_25=[]
other_up_24=[]
other_25=[]
for i in Black:
    if description(i).__contains__('25-44') or description(i).__contains__('45-64') or description(i).__contains__('65+'):
        Black_25.append(i)
    else:
        Black_up_24.append(i)    
for i in Hispanic:
    if description(i).__contains__('25-44') or description(i).__contains__('45-64') or description(i).__contains__('65+'):
        Hispanic_25.append(i)
    else:
        Hispanic_up_24.append(i)
for i in White:
    if description(i).__contains__('25-44') or description(i).__contains__('45-64') or description(i).__contains__('65+'):
        White_25.append(i)
    else:
        White_up_24.append(i)            
for i in other:
    if description(i).__contains__('25-44') or description(i).__contains__('45-64') or description(i).__contains__('65+'):
        other_25.append(i)
    else:
        other_up_24.append(i) 
In [27]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, figsize=(25, 25), sharey=True)
fig.subplots_adjust(hspace=-0.1)
cs=cm.Pastel1(np.arange(100))
mpl.rcParams['font.size'] = 15
fig.suptitle('Cause of Death, age under 25: Different Races', y=0.95, fontsize=30)
ax1.set_title('Black', fontsize=25)
labels=list(short_race_description(i) for i in Black_up_24)
ax1.pie(list(np.mean(df[i]) for i in Black_up_24),
        labels=labels, colors=cs, autopct='%1.1f%%', radius=0.8)

ax2.set_title('Hispanic', fontsize=25)
labels=list(short_race_description(i) for i in Hispanic_up_24)
ax2.pie(list(np.mean(df[i]) for i in Hispanic_up_24),
        labels=labels, colors=cs, autopct='%1.1f%%', radius=0.8)

ax3.set_title('White', fontsize=25)
labels=list(short_race_description(i) for i in White_up_24)
ax3.pie(list(np.mean(df[i]) for i in White_up_24),
        labels=labels, colors=cs, autopct='%1.1f%%', radius=0.8)

ax4.set_title('Other', fontsize=25)
labels=list(short_race_description(i) for i in other_up_24)
ax4.pie(list(np.mean(df[i])*2 for i in other_up_24),
        labels=labels, colors=cs, autopct='%1.1f%%', radius=0.8)

plt.show()
In [45]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, figsize=(25, 25), sharey=True)
fig.subplots_adjust(hspace=-0.1)
cs=cm.Set3(np.arange(100))
mpl.rcParams['font.size'] = 15
fig.suptitle('Cause of Death, age 25+: Different Races', y=0.95, fontsize=30)
ax1.set_title('Black', fontsize=25)
labels=list(short_race_description(i) for i in Black_25)
ax1.pie(list(np.mean(df[i]) for i in Black_25),
        labels=labels, colors=cs, autopct='%1.1f%%', radius=0.8)

ax2.set_title('Hispanic', fontsize=25)
labels=list(short_race_description(i) for i in Hispanic_25)
ax2.pie(list(np.mean(df[i]) for i in Hispanic_25),
        labels=labels, colors=cs, autopct='%1.1f%%', radius=0.8)

ax3.set_title('White', fontsize=25)
labels=list(short_race_description(i) for i in White_25)
ax3.pie(list(np.mean(df[i]) for i in White_25),
        labels=labels, colors=cs, autopct='%1.1f%%', radius=0.8)

ax4.set_title('Other', fontsize=25)
labels=list(short_race_description(i) for i in other_25)
ax4.pie(list(np.mean(df[i]) for i in other_25),
        labels=labels, colors=cs, autopct='%1.1f%%', radius=0.8)

plt.show()
In [30]:
df.Mammogram
US_plot('Mammogram')
In [31]:
US_plot('Brst_Cancer')

Correlations between Variables

The following table shows the Pearson correlation between variables:

In [54]:
df.corr(method ='pearson') 
Out[54]:
State_FIPS_Code County_FIPS_Code Strata_ID_Number Number_Counties Population_Size Population_Density Poverty Age_19_Under Age_19_64 Age_65_84 ... Salm_Exp Shig_Rpt Shig_Exp Toxic_Chem Carbon_Monoxide_Ind Nitrogen_Dioxide_Ind Sulfur_Dioxide_Ind Ozone_Ind Particulate_Matter_Ind Lead_Ind
State_FIPS_Code 1.000000 0.166480 0.000187 0.027089 -0.057695 0.023598 -0.020754 -0.077999 -0.037598 0.098458 ... -0.063681 -0.006557 -0.061807 -0.013879 -0.016436 -0.014068 -0.014068 -0.043496 -0.027528 -0.014104
County_FIPS_Code 0.166480 1.000000 0.044329 0.028094 -0.051829 0.052212 0.140420 0.127726 -0.110678 -0.004368 ... -0.055910 0.029384 -0.034715 -0.019483 -0.075387 -0.075417 -0.075417 -0.078794 -0.073277 -0.075084
Strata_ID_Number 0.000187 0.044329 1.000000 -0.016999 -0.385595 -0.185758 0.276217 -0.147028 -0.329400 0.388108 ... -0.465205 -0.297500 -0.419904 -0.155351 -0.793963 -0.795967 -0.795967 -0.683131 -0.770920 -0.793512
Number_Counties 0.027089 0.028094 -0.016999 1.000000 -0.085950 -0.031292 -0.108006 -0.073129 -0.075331 0.105265 ... -0.107958 -0.062619 -0.088583 -0.068168 -0.128768 -0.129497 -0.129497 -0.104967 -0.129573 -0.126564
Population_Size -0.057695 -0.051829 -0.385595 -0.085950 1.000000 0.335231 -0.055041 0.126784 0.104946 -0.193847 ... 0.978261 0.523775 0.901652 0.204323 0.477616 0.477760 0.477760 0.522135 0.509329 0.475792
Population_Density 0.023598 0.052212 -0.185758 -0.031292 0.335231 1.000000 0.016395 0.004501 0.098633 -0.091126 ... 0.342724 0.277823 0.335209 0.039236 0.221469 0.222360 0.222360 0.195464 0.213169 0.221272
Poverty -0.020754 0.140420 0.276217 -0.108006 -0.055041 0.016395 1.000000 0.259603 -0.204728 -0.005279 ... -0.054216 0.029800 0.015400 -0.022027 -0.188169 -0.189415 -0.189415 -0.169929 -0.166007 -0.189819
Age_19_Under -0.077999 0.127726 -0.147028 -0.073129 0.126784 0.004501 0.259603 1.000000 -0.232767 -0.611953 ... 0.151166 0.131822 0.161122 0.134323 0.184280 0.185081 0.185081 0.181667 0.197109 0.184625
Age_19_64 -0.037598 -0.110678 -0.329400 -0.075331 0.104946 0.098633 -0.204728 -0.232767 1.000000 -0.613342 ... 0.122207 0.053788 0.093824 0.018651 0.248541 0.249407 0.249407 0.224395 0.233078 0.249517
Age_65_84 0.098458 -0.004368 0.388108 0.105265 -0.193847 -0.091126 -0.005279 -0.611953 -0.613342 1.000000 ... -0.227688 -0.151378 -0.211261 -0.120355 -0.352608 -0.354018 -0.354018 -0.331683 -0.350170 -0.353629
Age_85_and_Over 0.055528 -0.033304 0.308272 0.149250 -0.128797 -0.043721 -0.154717 -0.484264 -0.575900 0.771622 ... -0.154897 -0.114786 -0.146075 -0.108127 -0.275947 -0.276819 -0.276819 -0.255826 -0.274541 -0.277009
White 0.133447 -0.075910 0.050902 0.097854 -0.129129 -0.131887 -0.500022 -0.369316 -0.042609 0.319331 ... -0.162362 -0.172070 -0.185727 -0.096732 -0.124132 -0.122404 -0.122404 -0.136416 -0.118568 -0.120972
Black -0.120601 0.114684 -0.067956 -0.068513 0.080557 0.105917 0.468740 0.211044 0.064710 -0.207554 ... 0.112266 0.157999 0.140907 0.042701 0.091464 0.089132 0.089132 0.101520 0.083280 0.087972
Native_American -0.012413 -0.037041 0.170212 -0.048394 -0.040336 -0.026498 0.226342 0.345781 -0.146041 -0.155817 ... -0.047572 -0.010939 -0.030453 0.091867 -0.085691 -0.085841 -0.085841 -0.071318 -0.079333 -0.085675
Asian -0.085615 -0.030404 -0.298746 -0.051751 0.371919 0.251575 -0.147999 0.033618 0.254596 -0.251417 ... 0.405557 0.160633 0.352655 0.058135 0.355381 0.357229 0.357229 0.349980 0.351114 0.355152
Hispanic 0.089078 0.186551 -0.000010 -0.090372 0.177456 0.080011 0.261863 0.377215 -0.166735 -0.155450 ... 0.188614 0.128523 0.209912 0.020591 0.107581 0.108616 0.108616 0.128345 0.150110 0.107469
A_Wh_Comp -0.029312 -0.058812 -0.676653 -0.091921 0.487843 0.246776 -0.184655 0.185239 0.202315 -0.318840 ... 0.589867 0.380722 0.536526 0.202953 0.848863 0.850609 0.850609 0.780980 0.826973 0.848863
A_Bl_Comp -0.049673 0.007241 -0.412558 -0.085502 0.504654 0.280974 0.043355 0.177234 0.094253 -0.221256 ... 0.586441 0.476924 0.599294 0.176808 0.532767 0.530516 0.530516 0.533147 0.521256 0.527381
A_Ot_Comp -0.068421 -0.039911 -0.158386 -0.033241 0.564519 0.205831 0.006007 0.086602 0.043834 -0.109430 ... 0.531361 0.250143 0.476221 0.063678 0.198661 0.199819 0.199819 0.241285 0.247745 0.198661
A_Hi_Comp -0.065505 -0.029620 -0.343323 -0.090848 0.568691 0.310721 -0.001666 0.185382 0.050173 -0.197408 ... 0.626309 0.417654 0.615965 0.153872 0.444369 0.446943 0.446943 0.498729 0.472176 0.444369
A_Wh_BirthDef -0.028210 -0.051162 -0.636564 -0.076685 0.436266 0.183831 -0.170034 0.210200 0.189369 -0.325595 ... 0.526048 0.325084 0.477642 0.186702 0.798312 0.800125 0.800125 0.727041 0.784277 0.798723
A_Bl_BirthDef -0.070206 -0.004281 -0.324732 -0.066874 0.441627 0.241096 0.052955 0.161699 0.075325 -0.190849 ... 0.496117 0.381038 0.499453 0.139546 0.413022 0.415488 0.415488 0.427418 0.415907 0.413022
A_Ot_BirthDef -0.080423 -0.040049 -0.140461 -0.037398 0.557984 0.148024 0.010556 0.111951 0.020887 -0.110582 ... 0.526707 0.207901 0.456649 0.062429 0.176435 0.177469 0.177469 0.236012 0.219806 0.176435
A_Hi_BirthDef -0.064606 -0.023227 -0.327105 -0.091647 0.557867 0.237694 0.010509 0.193692 0.039305 -0.192930 ... 0.607026 0.383143 0.603240 0.125247 0.423338 0.425793 0.425793 0.473590 0.469789 0.423338
B_Wh_Injury -0.038297 -0.033007 -0.443647 -0.093575 0.454671 0.168282 -0.095020 0.191817 0.098323 -0.235704 ... 0.536729 0.387989 0.523343 0.195774 0.581022 0.580252 0.580252 0.571518 0.583139 0.579948
B_Bl_Injury -0.052670 -0.003053 -0.275449 -0.062219 0.511274 0.246902 0.056870 0.116420 0.051028 -0.138798 ... 0.553361 0.554407 0.628306 0.170941 0.349121 0.340399 0.340399 0.396778 0.365568 0.338431
B_Ot_Injury -0.058671 -0.030789 -0.082704 -0.027904 0.301956 0.042021 0.055743 0.120535 -0.010346 -0.088581 ... 0.277110 0.074339 0.262254 0.026641 0.110358 0.111021 0.111021 0.146918 0.129003 0.110358
B_Hi_Injury -0.062971 -0.003602 -0.203791 -0.062322 0.526490 0.229442 0.075082 0.173901 -0.005684 -0.139484 ... 0.527444 0.327578 0.544732 0.152983 0.252386 0.253846 0.253846 0.319813 0.304050 0.252386
B_Wh_Cancer -0.036816 -0.057486 -0.391911 -0.103118 0.544201 0.238166 -0.126559 0.136772 0.098151 -0.198710 ... 0.608287 0.370350 0.563034 0.120092 0.523763 0.519882 0.519882 0.534496 0.521106 0.520605
B_Bl_Cancer 0.001103 -0.011633 -0.176170 -0.039895 0.344796 0.238466 0.000316 0.043919 0.055531 -0.084845 ... 0.369426 0.337097 0.370706 0.054105 0.220471 0.221746 0.221746 0.286345 0.206110 0.220471
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Elderly_Medicare -0.060462 -0.061846 -0.414316 -0.096123 0.975891 0.359282 -0.063062 0.085859 0.078726 -0.139912 ... 0.966849 0.541208 0.893566 0.193929 0.509482 0.508928 0.508928 0.534937 0.529253 0.506585
Disabled_Medicare -0.063724 -0.059331 -0.453836 -0.101623 0.951566 0.377399 -0.012484 0.112783 0.106722 -0.181366 ... 0.958829 0.593034 0.915940 0.223595 0.552455 0.549973 0.549973 0.568117 0.571947 0.547860
Prim_Care_Phys_Rate 0.039469 0.054575 -0.367165 -0.024644 0.239476 0.240694 -0.136083 -0.041162 0.148229 -0.108946 ... 0.290729 0.219749 0.276624 0.078987 0.403602 0.402937 0.402937 0.355692 0.381977 0.399838
Dentist_Rate 0.033332 0.046989 -0.351491 0.012996 0.224938 0.200731 -0.238312 -0.040990 0.051971 -0.043253 ... 0.268165 0.159774 0.238770 0.075403 0.357683 0.357216 0.357216 0.316674 0.333881 0.354731
Community_Health_Center_Ind 0.056379 0.080133 0.248034 0.110544 -0.206536 -0.091928 -0.202551 -0.141301 -0.107687 0.184092 ... -0.252749 -0.166207 -0.246000 -0.108273 -0.336761 -0.337101 -0.337101 -0.295736 -0.334612 -0.336761
HPSA_Ind -0.011824 -0.028053 -0.389554 -0.041094 0.140932 0.071885 -0.249281 -0.017005 0.111069 -0.092270 ... 0.168443 0.102670 0.143557 0.045709 0.323029 0.323986 0.323986 0.270331 0.312196 0.323029
ALE 0.068292 -0.116058 -0.118039 0.044560 0.077616 0.021072 -0.471124 -0.108697 -0.032374 0.075062 ... 0.075702 -0.007150 0.034283 -0.011957 0.111958 0.113758 0.113758 0.100475 0.110672 0.112776
US_ALE 0.047215 -0.035831 -0.034478 -0.008621 0.007616 0.003705 0.011083 0.018128 -0.066277 0.038159 ... 0.009372 0.005630 0.007932 0.004786 0.018494 0.018558 0.018558 0.015051 0.017841 0.018494
All_Death -0.041145 0.155984 0.111050 -0.065630 -0.123908 -0.053258 0.584501 0.201753 0.009334 -0.118428 ... -0.124655 -0.019767 -0.079639 0.029102 -0.152033 -0.154043 -0.154043 -0.137859 -0.152477 -0.152132
US_All_Death 0.026305 0.014827 0.012575 0.023392 0.002964 0.002619 0.030108 -0.020632 -0.022925 0.034206 ... 0.006767 0.003980 0.005652 0.003368 -0.002057 -0.002090 -0.002090 0.001196 -0.001652 -0.002057
Health_Status -0.071532 -0.156841 -0.205133 -0.056476 0.049070 0.027627 0.195638 0.021773 0.034609 -0.029180 ... 0.063393 0.045426 0.066223 0.015387 0.141615 0.141772 0.141772 0.096021 0.142762 0.141105
Unhealthy_Days -0.069151 -0.193268 -0.314015 -0.059104 0.094914 0.039982 0.075005 0.027332 0.121240 -0.109416 ... 0.119506 0.066915 0.106453 0.039471 0.254984 0.255696 0.255696 0.200098 0.246679 0.255032
No_HS_Diploma -0.054755 -0.036577 -0.296432 -0.067311 0.962787 0.347032 0.011493 0.111999 0.055542 -0.139897 ... 0.924458 0.458433 0.861879 0.177288 0.374627 0.374615 0.374615 0.416840 0.407747 0.373185
Unemployed -0.051741 -0.046191 -0.363763 -0.076678 0.984176 0.343902 -0.034334 0.128555 0.092031 -0.186129 ... 0.965569 0.528721 0.900925 0.218553 0.464577 0.465283 0.465283 0.508441 0.494589 0.463534
Sev_Work_Disabled -0.040673 -0.019162 -0.135636 -0.030011 0.442383 0.136569 0.015841 0.040079 0.022660 -0.051163 ... 0.417018 0.284334 0.385539 0.060136 0.170579 0.170439 0.170439 0.208749 0.178581 0.169699
Major_Depression -0.054328 -0.057759 -0.397738 -0.085458 0.993877 0.354087 -0.063995 0.119975 0.109446 -0.193299 ... 0.981313 0.543348 0.907616 0.219391 0.510661 0.510812 0.510812 0.554413 0.537864 0.508889
Recent_Drug_Use -0.073000 -0.060053 -0.353783 -0.078437 0.986504 0.316758 -0.051403 0.111245 0.096933 -0.176216 ... 0.962669 0.455990 0.883418 0.169594 0.453268 0.453795 0.453795 0.497024 0.483818 0.451829
Ecol_Rpt -0.010785 -0.078457 -0.386568 -0.075030 0.604387 0.159648 -0.161764 0.098793 0.139697 -0.208381 ... 0.626674 0.344481 0.567667 0.136968 0.473712 0.474900 0.474900 0.462097 0.475607 0.472871
Ecol_Exp -0.048630 -0.062575 -0.428539 -0.084986 0.982927 0.334131 -0.119662 0.129492 0.129380 -0.219203 ... 0.966532 0.518470 0.888216 0.208826 0.524909 0.525337 0.525337 0.558961 0.552803 0.523819
Salm_Rpt -0.027260 -0.043089 -0.425264 -0.098026 0.680106 0.445019 -0.045468 0.112150 0.106774 -0.180110 ... 0.723490 0.742114 0.698815 0.211120 0.532435 0.532681 0.532681 0.520582 0.529117 0.530456
Salm_Exp -0.063681 -0.055910 -0.465205 -0.107958 0.978261 0.342724 -0.054216 0.151166 0.122207 -0.227688 ... 1.000000 0.567064 0.944355 0.235507 0.583842 0.583461 0.583461 0.613932 0.604673 0.581096
Shig_Rpt -0.006557 0.029384 -0.297500 -0.062619 0.523775 0.277823 0.029800 0.131822 0.053788 -0.151378 ... 0.567064 1.000000 0.621932 0.168984 0.377316 0.376931 0.376931 0.401853 0.385331 0.375404
Shig_Exp -0.061807 -0.034715 -0.419904 -0.088583 0.901652 0.335209 0.015400 0.161122 0.093824 -0.211261 ... 0.944355 0.621932 1.000000 0.218626 0.535356 0.532349 0.532349 0.563673 0.559581 0.529744
Toxic_Chem -0.013879 -0.019483 -0.155351 -0.068168 0.204323 0.039236 -0.022027 0.134323 0.018651 -0.120355 ... 0.235507 0.168984 0.218626 1.000000 0.219927 0.217413 0.217413 0.249636 0.222776 0.219943
Carbon_Monoxide_Ind -0.016436 -0.075387 -0.793963 -0.128768 0.477616 0.221469 -0.188169 0.184280 0.248541 -0.352608 ... 0.583842 0.377316 0.535356 0.219927 1.000000 0.997508 0.997508 0.874168 0.967427 0.995015
Nitrogen_Dioxide_Ind -0.014068 -0.075417 -0.795967 -0.129497 0.477760 0.222360 -0.189415 0.185081 0.249407 -0.354018 ... 0.583461 0.376931 0.532349 0.217413 0.997508 1.000000 1.000000 0.874082 0.969951 0.997508
Sulfur_Dioxide_Ind -0.014068 -0.075417 -0.795967 -0.129497 0.477760 0.222360 -0.189415 0.185081 0.249407 -0.354018 ... 0.583461 0.376931 0.532349 0.217413 0.997508 1.000000 1.000000 0.874082 0.969951 0.997508
Ozone_Ind -0.043496 -0.078794 -0.683131 -0.104967 0.522135 0.195464 -0.169929 0.181667 0.224395 -0.331683 ... 0.613932 0.401853 0.563673 0.249636 0.874168 0.874082 0.874082 1.000000 0.866830 0.871056
Particulate_Matter_Ind -0.027528 -0.073277 -0.770920 -0.129573 0.509329 0.213169 -0.166007 0.197109 0.233078 -0.350170 ... 0.604673 0.385331 0.559581 0.222776 0.967427 0.969951 0.969951 0.866830 1.000000 0.967427
Lead_Ind -0.014104 -0.075084 -0.793512 -0.126564 0.475792 0.221272 -0.189819 0.184625 0.249517 -0.353629 ... 0.581096 0.375404 0.529744 0.219943 0.995015 0.997508 0.997508 0.871056 0.967427 1.000000

194 rows × 194 columns

Well, this is too much information about our big dataframe! To know about the important correlated variables we need more cleaning;

I'll make another small dataframe with 3 columns: column one and two--> correlated pairs and column 3 the magnitude of their correlations. To make it easier to see the highly correlated parameters I sort them. I also defined a threshold=0.7 to cut low-correlated ones.

In [33]:
not_helping_columns = ['Sulfur_Dioxide_Ind', 
                       'Nitrogen_Dioxide_Ind', 'Lead_Ind', 'Carbon_Monoxide_Ind', 'Particulate_Matter_Ind', 'Time']
attrs = df.drop(columns=not_helping_columns).corr()
threshold = 0.6
high_corrs = (attrs[abs(attrs) > threshold][attrs != 1.0]).unstack().dropna().to_dict()
df_corrs = pd.DataFrame(list((sorted(key)[0], sorted(key)[1], high_corrs [key]) for key in high_corrs),
                        columns=['attribute_1','attribute_2', 'correlation'])
df_corrs = df_corrs.iloc[abs(df_corrs['correlation']).argsort()[::-1]].drop_duplicates().reset_index(drop=True)
df_corrs.head(20)
Out[33]:
attribute_1 attribute_2 correlation
0 Major_Depression Population_Size 0.993877
1 Population_Size Total_Births 0.993762
2 Pert_Exp Population_Size 0.993672
3 Elderly_Medicare Total_Deaths 0.993192
4 HepA_Exp HepB_Exp 0.992540
5 No_HS_Diploma Uninsured 0.991927
6 HepA_Exp Population_Size 0.989902
7 Major_Depression Pert_Exp 0.989281
8 Population_Size Recent_Drug_Use 0.986504
9 Pert_Exp Total_Births 0.986449
10 HepA_Exp Total_Births 0.986186
11 HepA_Exp Pert_Exp 0.985803
12 HepA_Exp Major_Depression 0.984434
13 Population_Size Unemployed 0.984176
14 Major_Depression Total_Births 0.984092
15 Total_Births Unemployed 0.984065
16 HepB_Exp Population_Size 0.983862
17 Disabled_Medicare Total_Deaths 0.983111
18 Ecol_Exp Population_Size 0.982927
19 HepB_Exp Syphilis_Exp 0.982537

Alright! It looks better!

In [55]:
# If we want to make a df with tuples of correlated pairs:
# df_corrs = pd.DataFrame(list(set([(tuple(sorted(key)), high_corrs [key])
#                                   for key in high_corrs])), columns=['attribute_pair', 'correlation'])
In [34]:
#Here we plot the top 6 most correlated parameters:
colors = ['b', 'g', 'r', 'c', 'm', 'y']
markers = ['H', 'v', 's', 'P', '*', 'D']
fig, axs = plt.subplots(3,2, figsize=(12,15), facecolor='w', edgecolor='k')
fig.subplots_adjust(hspace=0.4, wspace=0.5)
fig.suptitle('Plots of highly-correlated parameters', y=0.97, fontsize=30)
axs = axs.ravel()

for i in range(6):
    axs[i].scatter(df[df_corrs.loc[i+1,'attribute_1']], df[df_corrs.loc[i+1,'attribute_2']], 
                   marker=markers[i], color=colors[i])
    axs[i].set_xlabel(short_description(df_corrs.loc[i+1,'attribute_1']), fontsize=14)
    axs[i].set_ylabel(short_description(df_corrs.loc[i+1,'attribute_2']), fontsize=14)
    axs[i].tick_params(axis='x', labelsize=10)
    axs[i].tick_params(axis='y', labelsize=10)
In [35]:
for i in Death:
    print(df_corrs[df_corrs['attribute_1']==i].head(10))
     attribute_1    attribute_2  correlation
162  Brst_Cancer    E_Wh_Cancer     0.886133
175  Brst_Cancer    F_Wh_Cancer     0.880687
179  Brst_Cancer     Col_Cancer     0.876601
190  Brst_Cancer      Premature     0.871133
194  Brst_Cancer            LBW     0.866469
195  Brst_Cancer    Lung_Cancer     0.865634
199  Brst_Cancer  F_Wh_HeartDis     0.863822
207  Brst_Cancer  E_Wh_HeartDis     0.859256
237  Brst_Cancer           VLBW     0.831433
238  Brst_Cancer         Stroke     0.831162
    attribute_1    attribute_2  correlation
110         CHD  F_Wh_HeartDis     0.935667
142         CHD  E_Wh_HeartDis     0.899126
171         CHD    Lung_Cancer     0.883059
183         CHD      Premature     0.875345
204         CHD            LBW     0.859882
224         CHD     Col_Cancer     0.843544
230         CHD    F_Wh_Cancer     0.835635
252         CHD      Unmarried     0.820675
255         CHD    E_Wh_Cancer     0.818792
264         CHD           VLBW     0.812496
    attribute_1    attribute_2  correlation
170  Col_Cancer    Lung_Cancer     0.883116
177  Col_Cancer    E_Wh_Cancer     0.876780
184  Col_Cancer    F_Wh_Cancer     0.873506
192  Col_Cancer  E_Wh_HeartDis     0.868112
205  Col_Cancer  F_Wh_HeartDis     0.859775
208  Col_Cancer      Premature     0.858899
218  Col_Cancer            LBW     0.850530
233  Col_Cancer         Stroke     0.833689
258  Col_Cancer           VLBW     0.816654
273  Col_Cancer      Unmarried     0.803242
    attribute_1       attribute_2  correlation
479    Homicide         Unmarried     0.724589
484    Homicide  RHI_Homicide_Ind     0.723311
554    Homicide              VLBW     0.697721
594    Homicide          Under_18     0.684708
602    Homicide               LBW     0.682850
712    Homicide         Premature     0.652485
720    Homicide  Infant_Mortality     0.650414
867    Homicide       IM_Neonatal     0.610523
919    Homicide    IM_Bl_Non_Hisp     0.603218
925    Homicide   IM_Postneonatal     0.602134
        attribute_1             attribute_2  correlation
229  IM_Bl_Non_Hisp  RHI_IM_Bl_Non_Hisp_Ind     0.836600
451  IM_Bl_Non_Hisp                    VLBW     0.732362
500  IM_Bl_Non_Hisp        Infant_Mortality     0.717898
503  IM_Bl_Non_Hisp                     LBW     0.716869
509  IM_Bl_Non_Hisp             IM_Neonatal     0.715536
581  IM_Bl_Non_Hisp               Premature     0.688660
624  IM_Bl_Non_Hisp               Unmarried     0.678101
810  IM_Bl_Non_Hisp             Lung_Cancer     0.622090
928  IM_Bl_Non_Hisp                  Stroke     0.601380
    attribute_1      attribute_2  correlation
240     IM_Hisp  RHI_IM_Hisp_Ind     0.830289
     attribute_1       attribute_2  correlation
83   IM_Neonatal  Infant_Mortality     0.957536
167  IM_Neonatal              VLBW     0.884262
214  IM_Neonatal               LBW     0.854181
225  IM_Neonatal    IM_Wh_Non_Hisp     0.839856
241  IM_Neonatal         Premature     0.830115
262  IM_Neonatal         Unmarried     0.814709
287  IM_Neonatal       Lung_Cancer     0.791710
333  IM_Neonatal            Stroke     0.772439
535  IM_Neonatal   IM_Postneonatal     0.704508
598  IM_Neonatal            Injury     0.683374
         attribute_1       attribute_2  correlation
202  IM_Postneonatal  Infant_Mortality     0.861712
335  IM_Postneonatal    IM_Wh_Non_Hisp     0.771664
355  IM_Postneonatal         Unmarried     0.762027
400  IM_Postneonatal               LBW     0.748177
402  IM_Postneonatal         Premature     0.747661
405  IM_Postneonatal              VLBW     0.746402
481  IM_Postneonatal       Lung_Cancer     0.724177
574  IM_Postneonatal            Stroke     0.691351
591  IM_Postneonatal          Under_18     0.685518
634  IM_Postneonatal            Injury     0.674898
        attribute_1       attribute_2  correlation
176  IM_Wh_Non_Hisp  Infant_Mortality     0.878677
286  IM_Wh_Non_Hisp       Lung_Cancer     0.793334
338  IM_Wh_Non_Hisp         Premature     0.770844
342  IM_Wh_Non_Hisp               LBW     0.769292
366  IM_Wh_Non_Hisp              VLBW     0.757914
389  IM_Wh_Non_Hisp            Stroke     0.750930
464  IM_Wh_Non_Hisp         Unmarried     0.728337
561  IM_Wh_Non_Hisp            Injury     0.695708
566  IM_Wh_Non_Hisp  Strata_ID_Number    -0.693857
568  IM_Wh_Non_Hisp           Suicide     0.693612
          attribute_1       attribute_2  correlation
146  Infant_Mortality              VLBW     0.897124
182  Infant_Mortality               LBW     0.875407
209  Infant_Mortality         Premature     0.858678
213  Infant_Mortality         Unmarried     0.854232
251  Infant_Mortality       Lung_Cancer     0.820784
285  Infant_Mortality            Stroke     0.795222
459  Infant_Mortality            Injury     0.729428
527  Infant_Mortality          Under_18     0.707677
603  Infant_Mortality         Ozone_Ind     0.682789
652  Infant_Mortality  Strata_ID_Number    -0.669357
    attribute_1       attribute_2  correlation
300      Injury         Premature     0.785381
305      Injury       Lung_Cancer     0.782913
328      Injury               LBW     0.773255
343      Injury           Suicide     0.767469
345      Injury         Unmarried     0.766347
427      Injury            Stroke     0.739471
492      Injury              VLBW     0.720905
647      Injury         Ozone_Ind     0.670945
781      Injury  Strata_ID_Number    -0.629603
790      Injury          Under_18     0.628107
     attribute_1       attribute_2  correlation
144  Lung_Cancer         Premature     0.897972
187  Lung_Cancer            Stroke     0.871812
221  Lung_Cancer              VLBW     0.847087
227  Lung_Cancer         Unmarried     0.838080
350  Lung_Cancer         Ozone_Ind     0.764632
384  Lung_Cancer  Strata_ID_Number    -0.753092
414  Lung_Cancer           Suicide     0.743307
681  Lung_Cancer          Under_18     0.659504
794  Lung_Cancer           Over_40     0.627460
    attribute_1  attribute_2  correlation
691         MVA  RHI_MVA_Ind     0.658178
855         MVA     Under_18     0.612438
    attribute_1 attribute_2  correlation
249      Stroke        VLBW     0.821874
281      Stroke   Unmarried     0.798352
556      Stroke     Suicide     0.697454
752      Stroke    Under_18     0.638731
    attribute_1 attribute_2  correlation
587     Suicide   Unmarried     0.687304
769     Suicide        VLBW     0.632481
      attribute_1 attribute_2  correlation
56   Total_Deaths  Unemployed     0.967528
126  Total_Deaths   Uninsured     0.917039
In [36]:
#Now the most correlated parameters with heart diseas:
num_plots = 16
colors = cm.tab20(np.linspace(0, 1, num_plots))
fig, axs = plt.subplots(int(num_plots/2),2, figsize=(12,40), facecolor='w', edgecolor='k')
fig.subplots_adjust(hspace=0.4, wspace=0.5)
fig.suptitle('Plots of most correlated parameters with heart disease', y=0.92, fontsize=30)
axs = axs.ravel()

corr_params = list(df_corrs[df_corrs['attribute_1']=='CHD']['attribute_2'])
for i in range(num_plots):
    axs[i].scatter(df['CHD'], df[corr_params[i]], 
                   marker='*', color=colors[i])
    axs[i].set_xlabel(short_description('CHD'), fontsize=14)
    axs[i].set_ylabel(short_description(corr_params[i]), fontsize=14)
    axs[i].tick_params(axis='x', labelsize=10)
    axs[i].tick_params(axis='y', labelsize=10)
In [37]:
#Now the most correlated parameters with breast cancer:
num_plots = 10
colors = cm.tab20(np.linspace(0, 1, num_plots))
fig, axs = plt.subplots(int(num_plots/2),2, figsize=(12,25), facecolor='w', edgecolor='k')
fig.subplots_adjust(hspace=0.4, wspace=0.5)
fig.suptitle('Plots of most correlated parameters with breast cancer', y=0.92, fontsize=30)
axs = axs.ravel()

corr_params = list(df_corrs[df_corrs['attribute_1']=='Brst_Cancer']['attribute_2'])
for i in range(num_plots):
    axs[i].scatter(df['CHD'], df[corr_params[i]], 
                   marker='*', color=colors[i])
    axs[i].set_xlabel(short_description('Brst_Cancer'), fontsize=14)
    axs[i].set_ylabel(short_description(corr_params[i]), fontsize=14)
    axs[i].tick_params(axis='x', labelsize=10)
    axs[i].tick_params(axis='y', labelsize=10)

Okk! Correlated parameters are great and can be discussed in statistical inference section. Now how about looking at the data from States point of view:

In [38]:
#A new dataframe based on States:
df_state = df.groupby('CHSI_State_Abbr').sum().reset_index() 
df_state.head()
Out[38]:
CHSI_State_Abbr State_FIPS_Code County_FIPS_Code Strata_ID_Number Number_Counties Population_Size Population_Density Poverty Age_19_Under Age_19_64 ... Salm_Exp Shig_Rpt Shig_Exp Toxic_Chem Carbon_Monoxide_Ind Nitrogen_Dioxide_Ind Sulfur_Dioxide_Ind Ozone_Ind Particulate_Matter_Ind Lead_Ind
0 AK 54 4162 1957 946 663661 213.0 327.9 795.1 1718.6 ... 118.857143 26.821429 61.107143 8.371644e+07 4.642857 4.642857 4.642857 4.642857 4.642857 4.642857
1 AL 67 4489 2745 2489 4557808 5736.0 1144.7 1689.5 4063.9 ... 971.142857 609.821429 555.357143 4.291972e+07 16.857143 16.357143 16.357143 18.357143 16.357143 16.357143
2 AR 375 5625 3993 2753 2779154 3877.0 1334.5 1866.1 4469.7 ... 589.821429 359.500000 269.000000 1.140120e+07 14.500000 14.500000 14.500000 15.250000 14.500000 14.500000
3 AZ 60 208 513 491 5939292 714.0 252.6 422.2 847.8 ... 945.000000 869.000000 645.071429 1.833658e+07 5.285714 5.285714 5.285714 6.535714 6.535714 5.285714
4 CA 348 3364 1201 2174 36132147 36531.0 746.4 1505.6 3564.9 ... 5618.250000 0.000000 3304.464286 2.311464e+07 22.285714 22.285714 22.285714 35.821429 25.571429 22.285714

5 rows × 195 columns

In [39]:
sns.set(style='whitegrid')
fig, ax = plt.subplots(1, figsize=(20, 10))
plt.bar(df_state['CHSI_State_Abbr'], df_state['Homicide'], color='b')
plt.xlim(-1, len(df_state.index))
plt.ylabel('The sum of homicide in all the counties of each state', fontsize=16)
plt.xticks(fontsize=12)
plt.show()
In [40]:
sns.reset_defaults()
us_state = gpd.read_file('../US-States/states.shp')
us_state_land = us_state[ (us_state['STATE_NAME']!='Alaska') 
                     &(us_state['STATE_NAME']!='Hawaii')]
us_state_Alaska = us_state[us_state['STATE_NAME']=='Alaska']
us_state_Hawaii = us_state[us_state['STATE_NAME']=='Hawaii']
us_state.head()
#Merging map and data:

df_state_land = us_state_land.set_index('STATE_ABBR').join(df_state.set_index('CHSI_State_Abbr'))
df_state_Alaska = us_state_Alaska.set_index('STATE_ABBR').join(df_state.set_index('CHSI_State_Abbr'))
df_state_Hawaii = us_state_Hawaii.set_index('STATE_ABBR').join(df_state.set_index('CHSI_State_Abbr'))

def min_max_col2(variable):
    minimum = min(df_state_land[variable].min(), 
              df_state_Alaska[variable].min(), df_state_Hawaii[variable].min())
    maximum = max(df_state_land[variable].max(), 
              df_state_Alaska[variable].max(), df_state_Hawaii[variable].max())
    return minimum, maximum

def US_state_plot(variable, color='Blues'):
    h, w = figaspect(1.)
    mn, mx = min_max_col2(variable)
    fig, ax1 = plt.subplots(1, figsize=(w*10, h*10))
    ax1.axis('off')
    ax1.set_title(short_description(variable), fontsize=40)
    divider = make_axes_locatable(ax1)
    cax = divider.append_axes('right', size='2%', pad=0.1)
    cax.tick_params(labelsize=25)
    df_state_land.plot(column=variable, cmap=color,
                        linewidth=0.8, ax=ax1,
                        edgecolor='0.8', vmin=mn ,vmax=mx,
                        legend=True, cax=cax)

    
    left, bottom, width, height = [0.15, 0.22, 0.24, 0.3] 
    ax2 = fig.add_axes([left, bottom, width, height])
    ax2.axis('off')
    ax2.set_title('Alaska', fontsize=25)
    df_state_Alaska.plot(column=variable, cmap=color, vmin=mn ,vmax=mx,
                          linewidth=0.8, ax=ax2, edgecolor='0.6')
    
    left, bottom, width, height = [0.35, 0.3, 0.2, 0.05]
    ax3 = fig.add_axes([left, bottom, width, height])
    ax3.axis('off')
    ax3.set_title('Hawaii', fontsize=25)
    df_state_Hawaii.plot(column =variable, cmap=color, vmin=mn ,vmax=mx,
                          linewidth=0.8, ax=ax3, edgecolor='0.6')
In [41]:
df_state_land['Homicide_per_capita'] = df_state_land['Homicide'].div(df_state_land['Population_Size'])
df_state_Alaska['Homicide_per_capita'] = df_state_Alaska['Homicide'].div(df_state_Alaska['Population_Size'])
df_state_Hawaii['Homicide_per_capita'] = df_state_Hawaii['Homicide'].div(df_state_Hawaii['Population_Size'])
new_item = pd.DataFrame({'COLUMN_NAME':['Homicide_per_capita'], 'DESCRIPTION':['Homicide per capita']})
DATA_ELEMENT_DESCRIPTION = DATA_ELEMENT_DESCRIPTION.append(new_item, sort=False)

US_state_plot('Homicide_per_capita', 'OrRd')
In [42]:
df_state_land['Suicide_per_capita'] = df_state_land['Suicide']/df_state_land['Population_Size']
df_state_Alaska['Suicide_per_capita'] = df_state_Alaska['Suicide']/df_state_Alaska['Population_Size']
df_state_Hawaii['Suicide_per_capita'] = df_state_Hawaii['Suicide']/df_state_Hawaii['Population_Size']
new_item = pd.DataFrame({'COLUMN_NAME':['Suicide_per_capita'], 'DESCRIPTION':['Suicide per capita']})
DATA_ELEMENT_DESCRIPTION = DATA_ELEMENT_DESCRIPTION.append(new_item, sort=False)


US_state_plot('Suicide_per_capita', 'BuPu')
In [93]:
#Relationship between total death, being unmarried and being uninsured:
sns.set(style='whitegrid')
fig, ax = plt.subplots(1, figsize=(6, 5))
cmap = sns.cubehelix_palette(light=1, as_cmap=True)
ax = sns.scatterplot(x='Unemployed', y='Uninsured', size='Total_Deaths', hue='Total_Deaths', 
                     sizes=(20, 200), palette=cmap, data=df_state)
ax.set_xlim(0)
ax.set_ylim(0)
Out[93]:
(0, 6709031.721550715)
In [100]:
sns.reset_orig()
f, ax = plt.subplots(figsize=(5, 4))
ax = sns.regplot(x='Unmarried', y='Infant_Mortality', data=df_state)
ax.set(ylim=(0, None))
ax.set(xlim=(0, None))
ax.set_ylabel('Infant Mortality')
Out[100]:
Text(0, 0.5, 'Infant Mortality')